1 Introduction

The dataset obtained from Electric Vehicle Population Data contains information on the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) registered through Washington State Department of Licensing (DOL). The owner of the dataset is the Department of Licensing, in this way, the data format is primary and internal. This dataset is open under the Open Database License ODbl.

The data consists of 17 columns describing some characteristics of the electric vehicles registered as March 21, 2025.

Objective: Understand which are the most frequently characteristics registered in the data set.

Particular objective: Learn to work with missing and zero values as well as with outliers.

1.1 Asking Questions

To analyse data, the following questions arise:

  • What is the time interval in the recollected data?
  • How many registered BEV’s and Plug-in electric vehicles are there?
  • How are the electric vehicle types distributed across the model year?
  • How many manufactures (makes) are registered? Which are the most registered makes?
  • How many models are in data, Which are the models with highest enrollments?
  • What is the percentage of eligibility by electric type?
  • Which is the city with most registration?

2 Preparing Data for Exploration

2.1 Exploring data

## # A tibble: 5 × 17
##   `VIN (1-10)` County   City    State `Postal Code` `Model Year` Make    Model  
##   <chr>        <chr>    <chr>   <chr>         <dbl>        <dbl> <chr>   <chr>  
## 1 5YJ3E1EBXK   King     Seattle WA            98178         2019 TESLA   MODEL 3
## 2 5YJYGDEE3L   Kitsap   Poulsbo WA            98370         2020 TESLA   MODEL Y
## 3 KM8KRDAF5P   Kitsap   Olalla  WA            98359         2023 HYUNDAI IONIQ 5
## 4 5UXTA6C0XM   Kitsap   Seabeck WA            98380         2021 BMW     X5     
## 5 JTMAB3FV7P   Thurston Rainier WA            98576         2023 TOYOTA  RAV4 P…
## # ℹ 9 more variables: `Electric Vehicle Type` <chr>,
## #   `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
## #   `Electric Range` <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
## #   `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
## #   `2020 Census Tract` <chr>
Data summary
Name Piped data
Number of rows 235692
Number of columns 17
_______________________
Column type frequency:
character 11
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing n_unique empty whitespace
VIN (1-10) 0 13763 0 0
County 3 212 0 0
City 3 788 0 0
State 0 48 0 0
Make 0 46 0 0
Model 0 171 0 0
Electric Vehicle Type 0 2 0 0
Clean Alternative Fuel Vehicle (CAFV) Eligibility 0 3 0 0
Vehicle Location 10 957 0 0
Electric Utility 3 76 0 0
2020 Census Tract 3 2204 0 0

Variable type: numeric

skim_variable n_missing
Postal Code 3
Model Year 0
Electric Range 36
Base MSRP 36
Legislative District 494
DOL Vehicle ID 0

Summary statistic for model year and electric range

##  Electric Range     Model Year  
##  Min.   :  0.00   Min.   :2000  
##  1st Qu.:  0.00   1st Qu.:2020  
##  Median :  0.00   Median :2023  
##  Mean   : 46.26   Mean   :2021  
##  3rd Qu.: 38.00   3rd Qu.:2024  
##  Max.   :337.00   Max.   :2025  
##  NA's   :36
  • Data is structured into 235, 692 rows and 17 columns.
  • 11 columns are type character and 6 are numerical.
  • The data is discrete.
  • From the total of columns, 15 are nominal and only 2 are ordinal: “Model Year” and “Electric Range”.
  • From the total number of registered vehicles, there are only 13,763 different Vehicle Identification Numbers (VIN).
  • The data contains missing and zero values.
  • The model year interval for the registered data goes from 2000 to 2025.
  • The registered electric range is between 0 to 337 miles.

2.2 Fixing column names and Selecting data for the analysis

Next, several column names have spaces and use capital letters. I prefer to work with variable names in lower case with underscores in the gaps.

I choose to work with the following 7 features: “city”, “make, model” , “model_year”, “electric_range”, “clean_alternative_fuel_vehicle_cafv_eligibility”, “electric_vehicle_type”. For easy manipulation, I also change the long names clean_alternative_fuel_vehicle_cafv_eligibility and electric_vehicle_type by the short ones eligibility and electric_type, respectively. In those columns I also rename the long name of the rows: Eligibililty unknown as battery range has not been researched is changed by unknown, Clean Alternative Fuel Vehicle Eligible by clean and Not eligible due to low battery range by negative.

3 Processing Data

Among the different features there is only a moderated correlation between eligibility and model year.

3.1 Distributions by electric vehicle type

In the following, the selected features are grouped according to electric type.

Data summary
Name Piped data
Number of rows 235692
Number of columns 7
_______________________
Column type frequency:
character 4
numeric 2
________________________
Group variables electric_type

Variable type: character

skim_variable electric_type n_missing complete_rate min max empty n_unique whitespace
city battery 2 1 3 24 0 699 0
city hybrid 1 1 3 24 0 531 0
eligibility battery 0 1 5 8 0 3 0
eligibility hybrid 0 1 5 8 0 2 0
make battery 0 1 3 22 0 38 0
make hybrid 0 1 3 20 0 27 0
model battery 0 1 2 24 0 103 0
model hybrid 0 1 2 17 0 72 0

Variable type: numeric

skim_variable electric_type n_missing complete_rate mean sd p0 p25 p50 p75 p100
model_year battery 0 1 2021.64 2.78 2000 2021 2023 2024 2025
model_year hybrid 0 1 2020.53 3.55 2010 2018 2022 2024 2025
electric_range battery 0 1 50.16 93.66 0 0 0 73 337
electric_range hybrid 36 1 31.27 14.60 6 21 30 38 153

The central tendency measures, mean and median (as p50), can be observed from the table above. The mode for several attributes is given in the following table

Mode for the selected variables by type
electric_type city eligibility model_year make model electric_range
battery Seattle unknown 2023 TESLA MODEL Y 0
hybrid Seattle clean 2024 TOYOTA VOLT 32

The column model year has a left skewed distribution. For battery electric type the mode is 2023, the same as the median, and the mean is 2022. The plug-in hybrid type has: mode 2024, median 2022 and mean 2021. The skewed distributions are shown in the distribution plots, mean and median are represented by dashed and dotted lines respectively. Outliers are observed in the Battery type case.

For electric range, the plots show right skewed distributions in both cases, battery and plug-in vehicle type. But battery type has a significantly skewed due to an extremely high value at range 0. This due to the observation described in the introduction.

For battery type: mean is 50, median and mode are 0. The plug-in hybrid type has: mean 31, median 30 and mode 32. In the following plot, the mean and median are represented by the dashed and dotted lines respectively.

max values of electric range across type
electric_type max.range
battery 337
hybrid 153

3.2 Outliers

3.2.1 Outliers in Model Year

The table and the boxplot below show that between 2000 and 2010, the years are intermittent, for instance, years 2001, 2004 are not included. Also the values obtained between 2000 and 2015 are outliers since the number of vehicles is too small compared to the rest of the years. Such outliers are also observed in the distribution plot for model year.

Count of registered vehicles by model year
model_year n
2025 11176
2024 49044
2023 59893
2022 28958
2021 20615
2020 12265
2019 10974
2018 14368
2017 8570
2016 5306
2015 4661
2014 3407
2013 4230
2012 1490
2011 680
2010 23
2008 22
2003 1
2002 2
2000 7

3.2.2 Outliers in the feature Electric Range.

The boxplots show several outliers in Battery Electric Vehicle Type and few outliers in Plug-in Hybrid Electric Vehicle Type.

3.3 Zero values

In the distribution of electric range plot for battery, the electric range has an extremely high value at 0. But we have to remember that electric range was no longer researched for new BEVs because new cars had an electric range of 30 miles or more, in such case, 0 was captured for electric range.

All zero values are associated to battery electric vehicle type .

Count of vehicles with 0 range by electric type
electric_type Count_0_range
battery 139761

Distribution of electric range across the battery electric type.

Percentage of the first 10 most registered ranges.
electric_range n pct_total_range pct_battery_range
0 139761 59.30% 74.7%
215 6403 2.70% 3.4%
238 4262 1.80% 2.3%
220 4057 1.70% 2.2%
84 3699 1.60% 2.0%
291 2365 1.00% 1.3%
208 2318 1.00% 1.2%
210 1836 0.80% 1.0%
75 1773 0.80% 0.9%
322 1719 0.70% 0.9%

There are 139 761 vehicles with 0 range, with respect to dataset length, this is the 59% of the dataset!. With respect to the battery vehicle type this amount is the 74.7%! This is a considerable amount of data with range 0 that is modifying the statistic for this variable, doing it significantly right skewed. Compared to this, the percentage of electric range above 190 is extremely low, therefore these values appear as outliers in the boxplots.

3.3.1 Zero values and Clean Alternative Fuel Vehicle (CAFV) Eligibility

Now, I explore the Clean Alternative Fuel Vehicle (CAFV) Eligibility which I called it simply eligibility for short. There are three classes:

## [1] "clean"    "unknown"  "negative"

Where

  • “clean” is Clean Alternative Fuel Vehicle Eligible.
  • “unknown” is Eligibililty unknown as battery range has not been researched.
  • “negative” is Not eligible due to low battery range.

With respect to the length of the dataset, the percentage of each class in eligibility is shown in the table.

Percentage of eligibility
eligibility Number.Of.Eligibility Percentage
clean 73317 31%
negative 22614 10%
unknown 139761 59%

From this table, the eligibility unknown has the higher amount of entries.

Is there a relation of between the 0’s in electric range with the unknown category?

## [1] "number of clean eligibility with range 0 : 0"
## [1] "number of  eligibility unknown with range 0 : 139761"
## [1] "number of negative with range 0 : 0"

All vehicles (139 761) that where registered with 0 range were classified with eligibility unknown and type battery.

Mean and Median of the eligibility class: clean
electric_type Median.Clean Mean.Clean
battery 215 199
hybrid 38 41

The mean and median of the eligibility class negative is below 41 in both type of electric vehicles . There is no unknown registrations for the plug-in type. The eligibility unknown is only registered for battery type, there is not registration of unknown eligibility for plug-in hybrid type.

##    model_year  
##  Min.   :2000  
##  1st Qu.:2016  
##  Median :2018  
##  Mean   :2018  
##  3rd Qu.:2019  
##  Max.   :2024

The median of model_year for the eligibility class is 2018.

Which model years are associated to these vehicles with 0 range and eligibility unknown? `

The plots above show that for battery electric type, there are 0 range values for year model 2008. Then, the rest of vehicles registered with 0 range correspond to model years between 2019 and 2025. The model years 2022, 2023 and 2025 have only 0 range values.

Is there a relation between the number of 0 range values with make?

It seems that several makes have associated 0 range value.

Summarizing exploration about zero modes:

  • The zero electric range values correspond to the class Eligibililty unknown as battery range has not been researched and to the battery electric vehicle type only. Most of these values were registered between 2019 and 2025.

  • For range different to zero and type battery there are associated two eligibility classes: clean and negative.

  • The clean class has an electric range median of 215 and a median of 2018 for the model year.

  • The negative class have range values below 41.

3.4 Missing values

##           city    eligibility     model_year           make          model 
##              3              0              0              0              0 
## electric_range  electric_type 
##             36              0

The feature electric range has also missing values.

## [1] "percentage_miss_range: 0"

The percentage of missing values in electric_range is 0.015 %, this is to small to do any harm. But for a better understanding of the missing values, I explore the columns in order to see where are such missing values.

## # A tibble: 1 × 2
##   model_year electric_type
##        <dbl> <chr>        
## 1       2025 hybrid

The 36 missing values correspond to model_year 2025 and to electric_type plug-in hybrid electric Vehicle (PHEV).

3.5 Cleaning data

It seems that the outliers in model year feature is due to the fact that electric vehicles were not popular during the initial years. Therefore, for my analysis I consider Model Years from 2011 till 2025.

Next, I replace the zero range values in battery electric type by 215, this is the median of the electric range registered in the clean eligibility.

## # A tibble: 1 × 1
##   Observations_0_range
##                  <int>
## 1                    0

Mode for electric range after replacing 0 values
electric_type Electric.Range
battery 215
hybrid 32

Although I have replaced the 0 values by the median of rnge in class clean, it has not solved the problem of outliers. Replacing the 0 values with the mean, moves the mode to the right. New outliers are still present because the number of the replaced values is more than half of the total data. There is not correlation between the features, except the slight correlation between model year and eligibility, the outliers does not affect the results of the analysis on the other features.

4 Analyzing Data

4.1 How many registered BEV’s and Plug-in electric vehicles are there?

electric_type n perc labels
hybrid 48692 0.2066399 21%
battery 186945 0.7933601 79%

4.2 How are the electric vehicle types distributed across the model year?

4.3 How many manufactures (makes) are registered? Which are the most registered makes?

##  [1] "TESLA"                  "HYUNDAI"                "BMW"                   
##  [4] "TOYOTA"                 "NISSAN"                 "KIA"                   
##  [7] "POLESTAR"               "MAZDA"                  "CHEVROLET"             
## [10] "VOLVO"                  "JEEP"                   "FIAT"                  
## [13] "LINCOLN"                "AUDI"                   "DODGE"                 
## [16] "RIVIAN"                 "VOLKSWAGEN"             "FORD"                  
## [19] "HONDA"                  "PORSCHE"                "MITSUBISHI"            
## [22] "LEXUS"                  "JAGUAR"                 "SMART"                 
## [25] "CHRYSLER"               "MERCEDES-BENZ"          "GMC"                   
## [28] "MINI"                   "SUBARU"                 "CADILLAC"              
## [31] "ACURA"                  "LAND ROVER"             "GENESIS"               
## [34] "LUCID"                  "ALFA ROMEO"             "FISKER"                
## [37] "VINFAST"                "BENTLEY"                "MULLEN AUTOMOTIVE INC."
## [40] "BRIGHTDROP"             "TH!NK"                  "LAMBORGHINI"           
## [43] "AZURE DYNAMICS"         "ROLLS-ROYCE"            "RAM"

There are 45 manufactures (makes). ### Makes across electric type

4.3.1 The first 10 manufactures with highest registration in Washington State for battery type

The first 10 makes with highest battery type registration
make Observations.Make.Battery
TESLA 101037
NISSAN 15532
CHEVROLET 12426
FORD 8874
KIA 8074
RIVIAN 6750
HYUNDAI 6331
VOLKSWAGEN 5976
BMW 3850
AUDI 2462

4.3.2 The first 10 manufactures with highest registration in Washington State for plug-in hybrid type

First 10 makes with highest plug-in hybrid type registration
make Observations.Make.Plugin
TOYOTA 8077
JEEP 5951
BMW 5797
CHEVROLET 4709
VOLVO 3929
CHRYSLER 3786
FORD 3724
KIA 3271
AUDI 1898
HYUNDAI 1075

4.4 Which are the models with highest enrollments?

4.5 Eligibility by type

4.6 Which is the city with most registration?

## # A tibble: 48 × 3
## # Groups:   electric_type [2]
##    electric_type city                  n
##    <chr>         <chr>             <int>
##  1 battery       Auburn             2224
##  2 battery       Bainbridge Island  1611
##  3 battery       Bellevue           9963
##  4 battery       Bellingham         2887
##  5 battery       Bonney Lake        1155
##  6 battery       Bothell            6752
##  7 battery       Bremerton          1412
##  8 battery       Burien             1038
##  9 battery       Camas              1635
## 10 battery       Edmonds            1939
## # ℹ 38 more rows

Seattle is the city with the highest registered vehicles battery and plug-in type. Across the Washington Cities there are more battery type vehicles registered than plug-in.

5 Conclusions